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Generalized Belief Propagation for Probabilistic Systems 

. FIELD OF THE INVENTION 

This invention relates generally to modeling probabilistic systems, and more 
particularly, to modeling probabilistic systems using belief propagation in a Markov 
5 network. 

BACKGROUND OF THE INVENTION 

m Computer models are frequently used to study the behavior of complex probabilistic 

ljlj systems. When the systems contain many inter-dependent random variables, Markov 

m 

fu networks are often used. In a Markov network, nodes of the network represent the 

™ possible states of a part of the system, and links between the nodes represent 

% statistical dependencies between the possible states of those nodes. 

W By the Hammersly-Clifford theorem, from the study of Markov networks, the 
probability of any set of states at the nodes of the network can be written as the 
product of compatibility functions between clusters of nodes. 

Figure 1 shows a simple network with four nodes labeled a, b, c, and d. The links 
20 between the nodes represent the statistical dependencies between the possible states 
of the nodes. For the case of pairwise probabilistic interactions between nodes of the 
network, the overall joint probability of the system can be expressed as the product 
of compatibility functions for each linked pair of nodes: 

P(Sa, Sb, S* Sd) = (pab(S a , S h ) 0b c (s bf S c ) <{> ca (S c , S a ) <pbd(Sb, S<l)> [1] 
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where (p a t is the compatibility matrix between nodes a and b, s a is a random variable 
describing the state at node a, and similarly for the other nodes and variables. 



Often, Markov networks for practical applications are very large. For example, an 
5 image acquired from a scene by a camera may be represented by a Markov network 
between all small neighboring patches, or even pixels, of the acquired image. 
Similarly, the well known "travelling salesman problem" can map onto a Markov 
network where the maximum probability state corresponds to the shortest path of the 
salesman's route. This network has as many nodes as cities to be visited. In some 

10 Markov networks, the nodes can represent measured input signals, such as visual 

O 

%q input data. Markov models are also extensively used in speech recognition systems. 

LP 

fu To analyze the probabilistic system modeled by a Markov network, one typically 
JTj wants to find the marginal probabilities of certain network variables of interest. (The 
ll* "marginal probability" of a variable signifies the probability of that variable 
ji! ignoring, the state of any other network variable.) For example, it may be useful to 
[M examine the probability of a variable that represents an underlying explanation for 
D some measured data, such as the probability of particular words used to vocalize 

particular speech sounds. To find those probabilities, the Markov network is 
20 marginalized over all the other variables in the network. This gives the probability of 

the variable representing the explanation, given the measured input data values. This 

marginalization is thus a form of inference. 



One may also want to find states of the nodes, which maximize the network 
25 probabilities. For example, for the Markov network corresponding to the travelling 
salesman problem, it is desired to find the state at each node which maximize the 
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probability of the Markov network. These states, which minimize the length of the 
salesman's route, are known as the maximum a posteriori probability (MAP) states,. 



In the example of Figure 1, it is possible to determine the marginal probability P(s a ) 
5 of the variable at node a by summing the random values at nodes b, c, and d: 

P ( S a ) = £ <X , S b )<j) bc (S h , S c )</) ca (S c , S a )<p bd (s b9 S d ).. [2] 

s b>*c> s tl 



In general, especially for large networks, these marginal probabilities are infeasible 
to determine directly. The joint sum over all possible states of all the nodes can be of 
l@j too high a dimension to sum numerically, particularly when the network has closed 
jjjjj loops. 

r-r z 

Jfi Figures 2a-b show examples of Markov networks with many loops for which it is 

^ difficult to find either the marginal probability at a node, or the state of the node 

ljS; which maximizes the overall probability of the Markov network. Both networks are 

h in the form of lattices, which are commonly used to describe the joint probabilities of 

M variables spatially, distributed over two dimensions. Figure 2a shows a rectangular 

fcsp 

lattice, and Figure 2b shows a triangular lattice. These type of lattice networks are 
used to model many systems. 

20 

Techniques to approximate the marginal probabilities for such structures are known, 
but these techniques are typically very slow. Simulated annealing can be used, or 
Gibbs sampling, see Geman et al. "Stochastic relaxation, Gibbs distribution, and the 
Bayesian restoration of images," IEEE Pattern Analysis and Machine Intelligence, 
25 6:721-741, 1984. Another class of approximation techniques are variational methods, 
see Jordan, "Learning in graphical models/' MIT Press, 1998. However, these 
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methods require an appropriate class of variational approximation functions for a 
particular problem. It is not obvious which functions, out of all possible ones, to use 
for the approximations. 

5 For the special case of Markov networks that form chains or trees, there is a local 
message-passing method that calculates the marginal probabilities at each node, see 
Pearl, "Probabilistic reasoning in intelligent systems: networks of plausible 
inference," Morgan Kaufmann, 1988. The later method is now in widespread use, 
and is equivalent to the "forward-backward" and Viterbi methods for solving one 
10 dimensional Hidden Markov Models (HMM), and to Kalman filters and their 
B generalization to trees, see Luettgen et al. in "Efficient multiscale regularization with 
03 applications to the computation of optical flow," IEEE Trans. Image Processing, 

fni 

Sj 3(l):41-64, 1994. This message-passing method gives the exact marginal 

«j probabilities for any Markov network that does not have loops. This is referred to as 

]% the "standard' belief propagation, or message-passing method below. 

iH Unfortunately, many Markov networks of practical interest do contain loops. For 
M example, an image, modeled as a Markov network of local image patches connected 

to their nearest neighbors, gives a lattice structured Markov network as shown in 
20 Figures 2a-b, also called a Markov random field. This type of network contains many 

loops. 

Another method for inference in Markov networks applies the local message-passing 
rules derived for trees and chains in a network, even though the network may contain 
25 loops, see Weiss, "Belief propagation and revision in networks with loops," 

Technical Report 1616, MIT Al Lab, 1997. This is referred to as the "loopy belief 
propagation method in the description below, although it should be clearly 
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understood that the "loopy" method is nothing more than the "standard" belief 
propagation method applied to a network with loops. When such a procedure 
converges, it can yield an approximate determination of the marginal probabilities. 
However, the loopy method sometimes gives too poor an approximation to the 
5 marginal probabilities, and often does not even converge. In the latter case, the 
approximation gives no single answer for the desired marginal probabilities. 

Therefore, it is desired to provide a method for determining marginal probabilities in 
Markov networks that is both relatively fast and more accurate than the loopy 
10 method. Furthermore, it is desired to provide a method for networks with loops that 
y3 converges more reliably than the prior art loopy belief propagation method. 



1*5l The present invention provides a method for determining the probabilities of nodes 
p:: in a network model of a complex probabilistic system. More particularly, the method 
Of determines desired marginal or maximum a posteriori probabilities in networks with 
C loops. The method uses a message-passing scheme, appropriate for networks with 
loops, which is more accurate and typically converges more reliably and in fewer 

20 iterations than prior art loopy belief propagation methods. 

The invention describes a class of procedures in which computational cost and 
accuracy can be traded off against each other, allowing a user of the invention to 
select more computation for more accuracy in a particular application of the 
25 invention. 



Seminary of the Emiveimtiomi 
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The invention has two major advantages over the prior art loopy belief propagation 
method. First, the invented method normally gives much more accurate answers for 
the desired marginal probabilities. Second, the invented method can converge to a 
single answer in cases where the loopy method does not. 



Instead of finding the marginal probability at a node, one embodiment of the 
invention finds the states at each node which approximately maximize the probability 
of the entire network. Thus, the invention provides a novel way to approximate both 
the marginal probabilities and MAP states in Markov networks. 



Many Markov network problems of interest are known to be NP-hard problems. A 

w 

problem is NP-hard when it is intrinsically harder than those problems that can be 
%: solved by a Turing machine in nondeterministic polynomial time. When a decision 
j~ version of a combinatorial optimization problem belongs to the class of NP-complete 
15 problems, which includes the traveling salesman problem described above, an 

5 optimization version is NP-hard. The invented method yields fast, approximate 

O 

fU solutions for some of these very difficult optimization problems. 



More particularly, the invention provides a method for determining marginal 
20 probabilities of states of a system represented by a model. The model includes nodes 
connected by links. Each node represents possible states of a corresponding part of 
the system, and each link represents statistical dependencies between possible states 
of related nodes. 



* > - — • - .. . 

' 2S The nodes are grouped into arbitrary sized clusters such that every node is included 
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associated sets of source nodes and destination nodes, a value for each state of the 
destination nodes and a rule that depends on other messages and on selected links 
Coiinectm^the-SQi irce nodes and de stination nodes. The values of each message are 
updated using the associated rule until a termination condition is reachear 
^5 point the probabilities of the states of the system are determined from the values of 
the messages. 



Brief Description! of the Orawnmigs 
10 Figure 1 is graph of a network with four nodes and links; 
J|: Figure 2a is a graph of a square network; 
Figure 2b is a graph of a triangular network; 



15 

f — ! 

2 Figure 3 is a flow diagram of a method for propagating beliefs in a network 
[U according to a preferred embodiment of the invention; 

Figures 4a-c are graphs of nodes grouped into arbitrary sized intersecting clusters; 

20 

Figure 5 is a graph of a nine node network grouped into four intersecting clusters; 
Figure 6 is a flow diagram of detailed steps of the method according to the invention; 
25 Figure 7 is a graph of a hierarchy of regions; 

Figures 8a-c are graphs of belief propagation for a rectangular network; 
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Figure 9 is a graph of an identity for the belief propagation of Figures 8a-c; 



Figure 10 is a graph of another identity of the belief propagation of Figure 7; 



Figures 1 la-c are graphs of belief propagation for a triangular network; 



Figure 12 is a graph of an identity for the belief propagation of Figures 1 la-c; 



10 Figure 13 is a graph of another identity for the belief propagation of Figures 1 la-c; 



m 

m 
Pi 



m 
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Figure 14 is graph of a nine node network grouped into four clusters; 



Figure 15a is a graph of the network of Figure 14 with the nodes grouped into four 
1£ clusters; 



iy Figure 15b is a graph of a super-node network showing standard message passing; 

o 

Figure 15c is a graph showing standard and normalized message passing; 



Figure 15d is a graph of a network with nine super-nodes; and 

Figure 16 is a flow diagram of an alternative embodiment of the belief propagation 
method according to the invention. 
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Detailed Description of the Preferred Embodiments 
Introduction 

5 Our invention builds on the "cluster variational method" for analyzing cooperative 
phenomena in statistical physics, see for example, the special issue in honor of R. 
Kikuchi, Progress in Theoretical Physics Supplements, vol. 115, 1994. 

A physical system is often modeled in statistical mechanics as a Markov network, 
10 where the states of the nodes in the network correspond to the physical states of the 
2 particles or spins in the physical system. The cluster variational method derives 

approximations to the "Gibbs free energy" of a physical system. We refer to these as 
^; "Kikuchi approximations" to the Gibbs free energy, or the "Kikuchi free energy." 
p From a Kikuchi approximation to the Gibbs free energy, one can determine 
15 properties of a physical system, such as magnetization, specific heat, or the critical 
ff* temperature for a magnetic spin system. 

\ i 'i 

o The simplest Kikuchi approximation derived with the cluster variational method 
corresponds to an approximation first proposed by Bethe, see Bethe, Proc. Roy. Soc. 

20 (London) A 150 (1935) 552. The loopy belief propagation method, when it 

converges, gives marginal probabilities identical to those obtained from the Bethe 
approximation. 

Solutions obtained from propagation rules according to our invention are equivalent 
25 to solving the appropriate physical system under the corresponding Kikuchi 

approximation. The method of minimizing the Kikuchi free energy directly will 
therefore give identical results to our method, but such a direct minimization will be 
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extremely slow and be plagued by the problem of convergence to inappropriate local 
minimal in the free energy surface. Thus, physics calculations made with more 
complicated Kikuchi approximations have typically only been tractable for 
homogeneous models of physical systems. As an advantage, our belief propagation 
5 method is also practical for heterogeneous models of physical systems where the 
nodes of the model have arbitrary topology and interdependencies. 

■ Calculation off margieal probabilities by Kikeclni free energy mimdmisatioe 

10 As shown in Figure 3, we begin with a network 300 representing a system or model. 
J Typically, the network is constructed ahead of time by an "expert" in the art field in 

which the model operates. It is not our job to construct the model; instead, we 
^ ] provide a method for using the model, even if the model is heterogeneous and 



Qj results, pathologies, etc. Here, the goal may be to use the model to make the best 

jj 'z 

[U (most probable) diagnosis given observations made for a particular patient. In an 
o error-correcting coding system, the nodes represent source bits or parity bits, and the 

links represent the required relationships between the bits. Here, the goal may be to 
20 use the model to decode a block of bits that has been corrupted by noise. 

The model includes a plurality of nodes 301. A random variable s at each node takes 
on one of a discrete set of states, or it may take on a continuum of states. The nodes 
are connected to others by links 302 to form the network model 300. Typically, a link 
25 connects two nodes, however higher-order links can connect more than two nodes. 
The network model 300 can represent a magnetic spin system, a medical diagnosis 
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includes many loops. For example, in a medical diagnosis system, the nodes can 
represent clinical findings, physiological and psychological conditions, laboratory 
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system, an image-processing system, a speech recognition system, a genetic 
regulatory network, a decoding system, or other complex real-world systems. 

By definition, for a Markov network with pair-wise statistical dependencies between 
nodes, the probability of a particular assignment of values s at the nodes is given by: 



where the first product runs over all the linked neighboring nodes, i and j. 

The 0 compatibility matrices represent the statistical dependencies between the 
nodes, as represented by the links 302 in the network model 300. For example, in a 
medical diagnosis system, one node might represent a symptom, and the nodes that it 
is linked to are related symptoms or diseases. The numerical values of the 
compatibility matrices <f>, represent the strength of the relation between specific 
nodes. For example, in a medical diagnosis model, some symptoms might be 
strongly correlated with certain diseases, i.e. a strong statistical dependency, while 
other symptoms might only have weak statistical correlation with other symptoms or 
diseases. 

For some systems, it may be desired to represent the statistical dependencies between 
the states of the nodes by compatibility functions ("links") that are higher-order 
tensors. For example, if the states of nodes ij, and k are mutually statistically 
dependent, but those dependencies cannot be expressed in terms of pair-wise 
interactions, then we introduce the tensor $^{s n s } ,s k ) to represent those 

dependencies. The over-all probability of the system would then also include the 
product over all higher-order tensors as well. The method we describe here will also 
apply to systems with such higher-order links between nodes. All that is necessary is 



P(s v s 



[3] 
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to include terms that arise from the higher-order links in the "energy" and "region 
energies" that we describe below. 

The ^functions represent the "evidence" that each node is in any particular one of 
its states before we consider any of the other nodes. While the <p functions will 
normally never change, the ^functions will typically change from one use of the 
model to another. For example, in the case of a medical diagnosis model, the y/i(Si) 
function represents, for a specific patient, the results of a given test. To use the same 
model for a different patient, we use different evidence, represented through the yr 
functions, but we do not change the (p functions, which represent the relationships 
between different symptoms, diseases, tests, etc. 

In equation [3], Z is a normalization constant that insures that the sum of the 
probabilities of all possible states of the overall system is equal to one. 

In order to describe the "Kikuchi free energy" of the model, it is helpful to define 
some additional functions, which are motivated by common conventions from 
statistical physics. We define the "bond strength" Jij(s if Sj) between two nodes i and j 



where T is a parameter called the "temperature." We define the "field" htfsi) at the 
node i by: 



by: 



[4] 



-h,( Si )/T 



[5] 



In terms of these variables, we have: 




e 



-£(s ) ,.s 2> ...,.s A ,)/r 



[6] 



12 



MERL-1272 
Yedidia et al. 



where E is the "energy": 

E{s x ,s 2 ^s N ) = J^J ij (5. , s j ) + £ fc. (j. ) . [7] 



In statistical physics, equation [6], which is equivalent to the original definition [3] of 
the Markov model, is known as "Boltzmann's Law," and the normalization constant 
Z is called the "partition function." The partition function is defined explicitly by 

2" _ g-E(S u S 2 ,...,S N )/T 



[8] 

S X ,S 2 ,:.S N 



Following the conventions of statistical physics, we define a quantity called the 
"Gibbs free energy." The Gibbs free energy G is a function of the probability 
distribution P(s 1 ,s 2 ,...,s N ). By design, G is defined so that if it is minimized as a 
function P, then Boltzmann's Law is automatically obeyed. G is defined by: 

G(P(s x ,s 2 ,...,s N ),y)=u-Ts+ru- 5>o?,,5 2 , ...,*„)]. [9] 

Here, U is the expected value of the energy, also called the "internal energy": 

U = £ E(s ] ,S 7 ,...,S N )P(5, , s 2 s N ) [10] 

and S is the entropy: 

S = ~ J, P(s ] ,s 2 ,...s N )ln(P(s l ,s 2 ,:..s N )). [11] 

The last term in equation [9] is a Lagrange multiplier term designed to enforce the 
constraint that the sum of the probabilities over all states is equal to one. In order to 
insure that this constraint is satisfied, we set the partial derivative of G with respect 
to the Lagrange multiplier y equal to zero. 

If we differentiate the Gibbs free energy G with respect to the probability distribution 
P and the Lagrange multiplier y, and set those partial derivatives equal to zero, then 
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we recover Boltzmann's Law. Thus, to understand physical systems, physicists often 
try to calculate and differentiate the Gibbs free energy. 

For a large system, it is intractable to calculate G exactly, so Bethe, Kikuchi, and 
5 many others have developed approximation schemes. In these schemes, one 

determines exactly the Gibbs free energy over some small clusters of nodes of the 
network, and then approximates the total Gibbs free energy as a sum over many 
small clusters of nodes, correcting for over-counting in regions where the different 
clusters overlap. In the Bethe approximation, the clusters are precisely those pairs of 
10 nodes that are linked in the Markov network, while Kikuchi 's more accurate 

5 r. 

5 approximations, also called the "cluster variational method," allow for clusters of a 
^ larger size. 



J* The cluster variational method errs by missing contributions to the Gibbs free energy 



V5 that arise from high-order correlations between nodes that cannot be described in 
ff} terms of the chosen clusters or intersections of clusters. The larger the size of the 
fU small clusters, the smaller the error. 



In principle, the cluster variational method allows for arbitrary clusters on arbitrary 
20 Markov networks. In practice, however, most previous computations involving 
minimization of a Kikuchi free energy were restricted to symmetrically placed 
clusters over a homogenous network. Computations on inhomogeneous networks 
have been performed, but only for the Bethe approximation, see Morita, Physica 98A 
(1979) 566. 



By "homogenous" networks, we mean networks for which the nodes are arranged in 
some regular repeated pattern, and the interactions between the nodes depends only 
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on their position in that pattern. Many magnetic spin systems are "homogenous" in 
this sense, but most of the other types of systems that are modeled with Markov 
networks, such as the example of a medical diagnosis system, are not homogenous. 

Previous prior art Kikuchi free energy minimization calculations were restricted to 
symmetrically placed clusters over a homogeneous network because, before our 
invention, there was no practical way to quickly minimize a Kikuchi free energy 
defined for arbitrary clusters over an inhomogeneous Markov network of arbitrary 
topology. 

Using the Kikuchi approximate form for the total Gibbs free energy, we can 
approximately calculate the marginal probabilities of the variables at the nodes, and 
also the marginal probabilities of collections of nodes. We extend this technique to 
develop a message-passing method that converges to solutions corresponding to a 
Kikuchi approximation of the Gibbs free energy. A message-passage method, as an 
advantage over the prior art Kikuchi free energy calculations, is very much faster and 
more practical for inhomogeneous networks. 

Kikuchi free emieirgy calcelatiomi for simple example 

We use Figures 4a-c to illustrate our generalized message-based belief propagation 
method. This example network shows how we develop a message-passing method 
that has a corresponding Kikuchi approximation result as a fixed point of the 
messages. That is to say, our message-passing method will converge to a solution 
that is equivalent to the solution of the Kikuchi approximation. 
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Figure 4a is a graph of many nodes grouped into three clusters. The clusters intersect 
each other. The nodes and their links are not shown, and indeed, the number of nodes 
per cluster can be quite large. In real world physical systems, one usually chooses 
many more than three clusters for a Kikuchi approximation. Figures 4b-c depict two 
possible instantiations of the three clusters and their intersections. 



s described above with respect to Figure 3, we group 310 the nodes of the network 
300 into arbitrary intersecting clusters 31 1. A cluster is said to be intersecting if a 
single node appears iniliei^than one cluster. Because we do an arbitrary clustering, 
we can focus on regions of the mocteKh^ might be more significant to an exact 
lution. We require that every node is include94n^at least cluster. We-also require 
;~th-at every link in4he Markov network is completed-inclufred^n at least one cluster. 
The first constraint ensures that our clusters represent and consider'MMiodes of the 
model, and the second constraint ensures that the clusters will intersect. 



Figure 4a shows abstractly three clusters of nodes (regions 7, 2, and J), and their 
intersections (regions 72, 31, and 23), and intersections of intersections (region 123). 

We use the following terminology. The largest regions are referred to as "clusters." 
The clusters, along with their intersections, and the intersections of those 
intersections, and so on, are referred to collectively as "regions." A "sub-region" of 
region r is a region that consists of a proper subset of the nodes in region r. Region r 
is a "super-region" of region s if and only if region s is a sub-region of region r. 

The number of nodes in each cluster can vary, and be different from each other. As 
stated before larger clusters will yield better results than small clusters albeit at a 
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greater computational cost. Note that region 12 is interpreted to include region 123, 
just as region 1 also includes region 123. 

In equation [12] below, we approximate the total Gibbs free energy for the system 
300 by the sum of the free energies of each of the clusters. Then, we include 
correction terms by subtracting the free energy of intersection regions where we 
over-counted the free energy. We may include additional correction terms for the 
over-counted intersections of the intersection regions, etc. In addition, we impose, 
with Lagrange multipliers, the additional constraints that the probabilities of each 
region sum to one, and that the probabilities have the correct marginalization 
relationships with respect to each other. 

We express the Gibbs free energy in terms of the probabilities of each of the regions. 
We let (Xi be a particular assignment of node values within the cluster region i. This is 
what we called s 7 , s 2 , % in equation [3] above. Our notation for region 
intersections is to concatenate the indices of the intersecting regions, so is a 
particular assignment of node values within the region of intersection of regions ij 9 
and k. 



The "region energy" E( (% ) includes those terms in the overall energy that can be 
attributed to nodes or links contained entirely in the region i, when the nodes in the 
region are set to the particular values denoted by a&. Note that if there are higher- 
order tensor "links" in the definition of the system, the region energy will still 
contain those links that are contained 
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The "region probability" P(ad is the marginal probability that the nodes in cluster i 
take on the particular values denoted by a;. In our notation, the normalization 
constraint for region i is ^ P(a t ) = 1 . That is, the sum over the probabilities of all 

possible value assignments to the nodes of region i is one. An analogous equation 
holds for all regions. These are the Lagrange multiplier constraints y in equation [12], 
below. 



The second set of constraints for the Gibbs free energy in equation [12] are the 
marginalization constraints, associated with Lagrange multipliers L The Lagrange 
10^ multipliers impose consistency constraints between the probabilities of the regions 
tfl and their intersecting sub-regions. The probability of a region, marginalized over all 
Ci the nodes in the region not in some intersection region, must equal the marginal 
ry probability of the nodes in that intersection region. In our 

m notation, X „ P( a i ) = P( a u ) , where X„ means the sum over all nodes in 

r a i\j t\ j 

o 

l|i region i that are mot in region j. 

P 

g For the three clusters of nodes shown in Figures 4b-c, we express the Gibbs free 
energy in the Kikuchi approximation as Gjjj- We have three sets ofU-S T terms, 
first a sum over states (% for the cluster region free energies, second the terms 
20 involving ctj, where we subtract over-counted intersection regions, and a third term 
involving a }2 3 where we correct for over-subtracting that intersection region. After 
all these corrections are made, each subset of each region is counted exactly once, as 
it should be. 



25 Gi, 2 ,3 ={U-ST) terms + y constraints + X constraints 
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= X )£(«,) + X P(a 2 )E(a 2 ) +X P(a 3 )£(« 3 ) 

or, Oj a, 

+ ) \n{P{a x )) + Tj j P(a 2 ) ln(P(ar 2 )) + 7^ P(a 3 ) ln(P(a 3 )) 

a x a 2 a 3 

- X P(a l2 )E(a n ) - X ^(« 23 )£(« 23 ) "I ^(« 31 )E(a 3l ) 

flr !2 ar 23 or 31 

- rX P(« 12 ) ln(P(ar I2 )) - P{a n ) ln(P(ar 23 )) - P(a 3l ) ln(P(a 31 )) 

or, 2 ar 23 or 3 i 

+ rX PCOTm )^(«123 ) + 7 X ) W(« 123 )) 



"123 



+ Xl2 



i-X^,) 



«1 



+ r 2 



or 2 



+r 3 



«3 



i-X^) 



«12 



+r 2 



23 



a 23 



i-X^K) 



«31 



123 



P(a 12 )-X^(«,) 



+ X ^2U2 (^12 ) 



P(a 12 )-X^(« 2 ) 



+X ^23(^23) 



+X ; W a 3i) 

«31 



p(a 23 )-X^(a 2 ) 



+ X^\23(« 23 ) 



/ > (« 23 )-X P (° r 3) 



/»«*„)- 



+ X A l\3l(«3 l ) 



/>(<*„)- X^W 



^^2X123(^123) 



X ^23\123^123) 

«123 

+ 5rf^Jl\123(^123) 

<*123 



p{<*™)- X^, 2 ) 



or 12M23 



a« 123 )- X^ 23 ) 



^23X1 23 



^(« 123 )- X^(«3.) 
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One point about the marginalization constraints. One might wonder why we do not 
also have terms likt^^ m ((x m )[P(a m )-^ a P(a } )] which enforce that region 1 



properly marginalizes down to region 123. We omit these terms because they are 
redundant with the marginalization equations already included in Gjjj. 

Marginalization is a linear operator. Of the three possible constraints relating regions 
U (/, and ijk 9 that region i marginalize down to region ij 9 that if marginalize down to 
ijk, and that region / marginalize down to ijk, we only need two, and the third is 
linearly dependent on the other two equations. 

The Kikuchi approximation to the Gibbs free energy can be generalized to the case 
of an arbitrary collection of regions in a relatively straightforward way, but we need 
to establish some additional more notation. Let r denote a region chosen from the set 
of all relevant regions R. Let S(r) denote the set of sub-regions of r and T(r) denote 
the set of super-regions of r. Define a "direct sub-region" of r as a sub-region that 
does not have a super-region that is also a sub-region of r. Thus, in our three cluster 
example, regions 12 and 13 are direct sub-regions of region 7, but region 123 is not. 
Let Sd(r) denote the set of direct sub-regions of r and T d (r) denote the set of direct 
super-regions of r. Associate with each region r a "over-counting number" d r , 
defined recursively by 



The largest regions with no super-regions are defined to have a double-counting 
number of one. In the three-cluster case, we have dj=d 2 =d 3 =l y d] 2 =d 2 3=d 3 j=-l > and 
d]23=l- The over-counting number is needed to insure that each link is ultimately 
counted exactly once. 




[13] 
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The generalization of equation [12] for the Kikuchi approximation to the Gibbs free 
energy for the case of an arbitrary collection of regions is: 



G = X dAU r -TS r ) + y r [l-^P(a r )]+ £ £i(aJ[P(aJ-J%)] 
where U r and S r are the internal energy and entropy of the region r. 



[14] 



We want to find the cluster region probabilities that minimize the Kikuchi free 
energy while obeying the marginalization and normalization constraints. These 
probabilities are the correct marginal probabilities under the Kikuchi approximation. 

To find equations for those probabilities, we differentiate equation [14] with respect 
to each of the region probabilities for each possible assignment of values to a region 
and set the result to zero. The resulting equations, illustrated for the three-cluster 
example of Figure 4a, are: 



Eia^ + T + T\n P(a l ) + y, =^ 12 


(«12 ) + («3l) 


[15] 


E(a 2 ) + T + T In P(a 2 ) + y 2 = l^ 2 (q n ) + A m (a 23 ) 


[16] 


E(a 3 ) + T + T In P(a 3 ) + y 3 = (a 23 ) + A ai (or 3I ) 


[17] 


-E(a l2 )-T-T In P(a l2 ) + y n = 


~^\\\2 ( a \2 ) ~ \ 12 (^12 ) + ^12\123 (^123 ) 


[18] 


-E(a 23 )-T-T In P(a 23 ) + y 23 = 


-/ ^2\23 (^23 ) ~ \ B (#23 ) + ^23\123 (#123 ) 


[19] 


-E(a 3l )-T-T\nP(a 3l ) + y 3i = 


- A\31 (#31 ) ~ \ 31 (#31 ) + ^31\123 (#123 ) 


[20] 


E(a 123 ) + T + T\nP(a l23 ) + y 123 = 


— -^12\I23 (#123 ) — ^ mm (#123 ) — ^31\123 (#123 ) • 


[21] 



Exponentiating both sides and rearranging terms of each equation, we have the 
following set of equations for the marginal posterior probabilities: 
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P(a,)=fc,exp 



P(a 2 )=k 2 zxp 



T 



r 



exp 



4/12(^12) + 4\3l(°3l) 



V 



T 

\ 1 J 



\ f 
exp 



^2/12(^2) + Aj\23fe) 



V 



P(a 3 )=Jc 3 exp 



f-F(fY\\ 



T 

V 1 J 



exp 



^3/23(^23) + ^3\3l(^l) 



P(«i 2 ) = /: 12 exp 



£(<z 12 ) 



exp 



^1/12(^12) +^2x12(^12) ^12x123(^23) 



[22] 



[23] 



[24] 



[25] 



Pte^^fc^exp 



P(a 3l )=k 3l Qxp 



E(a 23 ) 



exp 



^2/23 (**23 ) "^3X23 (°*23 ) ^23\1 23 (°1 23 ) 



^(^l)\ vr .r ^3/3 1 (^3 1 ) + 1 (°3 1 ) _ ^ 1\123 (^23) ^ 



) 



[26] 



[27] 



P(a 123 ) = A; 12 3exp 



exp 



^12\123(^123) ^23\123 (**123 ) ^31\123(^123)^ 



[28] 



where the k's are normalization constants, different for each equation, that normalize 
each marginal probability. Each of these equations is actually a collection of 
equations. Thus, if we focus on equation [22], we get one equation for each 
instantiation of 0Cj. On the right hand side of this equation, an and c^y are completely 
determined by a h because regions 12 and 31 are sub-regions of region 1. 



Of course, we also need to differentiate the Gibb's free energy with respect to the X 
and y Lagrange multipliers. Doing that gives us the desired marginalization and 
normalization conditions for the region probabilities. 
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From Lagrange multipliers to messages 

In the previous section, we derived a set of self-consistent equations for region 
probabilities and Lagrange multipliers by differentiating the Kikuchi approximation 
of the Gibbs free energy. These equations are a solution to the problem of finding 
marginal posterior probabilities at the level of the Kikuchi approximation. 



In our new method, we develop equivalent and more convenient formulations of the 
same solution, in terms of a convergent "message-passing" method. By defining 320 
10 the Lagrange multipliers, see step 320, in terms of another set of quantities, called 
^ "messages," we obtain a different but equivalent set of equations for the cluster 
^ probabilities. 

5 Our method works by treating the self-consistent equations as message update rules 
16 321 that are iterated until a termination condition is reached, for example, a fixed 
Si number of iterations, or the solution converges 330, at which point the marginal 

.SBC?. 

\ssr 

nj posterior probabilities 332 can be determined from the converged messages 340. 
3 Before the first update, we assign initial values 331 to the messages 321 . The initial 

values can be random or special values, e.g., all random positive numbers, or all 
20 ones. 



In general, there are many equivalent sets of message definitions and message-update 
rules that correspond to the same Kikuchi approximation. Some particular message 
definitions and message-update rules, which we describe later, have particularly nice 
25 properties and reduce to the already known ordinary message-passing methods when 
applied to the Bethe approximation. In this section, we want to describe the general 
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properties that should be satisfied by messages so that any particular message- 
passing method is equivalent to the corresponding Kikuchi approximation. 

We have used the marginalization relations to relate region probabilities to 
5 probabilities of their intersection regions. For the three-cluster example, equations 
[22]-[28] describe how the probabilities are related to the region energies, and the 
normalization and marginalization Lagrange multipliers. We ignore the 
normalization Lagrange multipliers (the y 's) for now, knowing that these simply give 
a multiplicative scalar which normalizes the final probabilities. 



Our goal is to express messages in terms of the X Lagrange multipliers. It will turn 
out that the marginalization constraint relations will then give us a set of fixed point 
equations for the messages, which we will interpret as the message update rules 321. 



15, We stack all of the X Lagrange multipliers to create a vector X . In this vector, there is 



yj one X Lagrange multiplier for each linearly independent marginalization constraint, 
fly times the number of different possible variable assignments in the cluster region 

i=s _ 

C3 being marginalized down. For our three cluster example, the vector X includes nine 
sets of Lagrange multipliers: X N2 ( a 12 ), X^i a 31 ), X 2/}2 ( cc 12 ), X 2/23 ( OC23), h m { OC23), h/ 
20 31 ( a 31 ), X nn23 { am), hsmi otm), and X 31/123 ( a ]23 ), which we take to be stacked in that 
order. The total dimensionality of the vector X is D x =2 D 12 + 2 D 23 + 2 D 31 + 3 
D ]23 , where, for example, Dj 2 is the dimensionality of aj 2 . 

In our example, equations [22-28] give the cluster probabilities in terms of the 
25 Lagrange multipliers X . We take the logarithm of all these equations, and define the 
local quantities 



10 
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L(a) = E(a) + TlnP(a) [29] 
for each cluster. Then we summarize these equations by: 

L=AX, [30] 
where L is a vector of the L's in the seven regions taken in the order 1, 2, 3, 12, 23, 
31, and 123, and L has dimensionality: 

D L = D 1 + D 2 + D 3 + D 12 + D 23 + D 3] + D 123 . 



A is a matrix with D L rows and Dx columns, but it has a special block form, which we 
express as: 



A = 





1 


0 


0 


0 


0 


0 


0 


0 


0 


0 


1 


1 


0 


0 


0 


0 


0 


0 


0 


0 


0 


1 


1 


0 


0 


0 


1 


0 


1 


0 


0 


0 


-1 


0 


0 


0 


0 


0 


1 


1 


0 


0 


-1 


0 


0 


1 


0 


0 


0 


1 


0 


0 


-1 



0 0 0 0 0 0-1-1-1 



[31] 



Each entry in this matrix actually denotes a sub-matrix. For example, the 1 in the 
upper left corner denotes a sub-matrix of Dj rows and T>n columns which is equal to 
one when the nodes in an agree with the corresponding nodes in a>, and is equal to 
zero otherwise. This becomes clear in a more specific example. 

If region 1 includes two nodes S a and St, and region 2 includes the single node S b , 
and each node can be in two states that we label "1" and "2," then (Xj can be in any of 
the four states S a =l, S b =l; S a =l, S b =2; S a =2, S b =l; S a =2, S b =2 and a x2 can be any 
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of the two states 5^=7; S b =2, and the sub-matrix 1 in the top left comer of A is 
expressed as: 



fl 


°] 


0 


1 


1 


0 







In general, the self-consistent equations derived from a Kikuchi approximation can 
be summarized in the form of a matrix equation L = AA and a set of normalization 
and marginalization conditions. 

We desire an alternative formulation for our solution in terms of "messages" and 
self-consistent message equations. We still use the same normalization and 
marginalization conditions on the cluster probabilities, but we redefine the Lagrange 
multipliers X in terms of messages. 

We store the logarithms of all the messages between the regions in an ordered vector 
in . To be commensurate with the Lagrange multipliers that we seek to summarize, 
we normally let the dimensionality of m be the same as the dimensionality of X . In 
general, the vector m can have different dimensionality than X , but the vector needs 
to span the same sub-space of constraints on the lattice so that we can re-write the 
Lagrange multiplier % in terms of messages in . 

Thus, for our example, we generate sets of logarithms of messages commensurate 
with I as m M2 {a n ) y m ]/3] (a 3] ), m 2 \i 2 (cci2), m 2 s2 3 (a 23 ) f mj^ 3 (a 2 3) f mw(a 3 i) y 
mi2\i23(ai23)> m 23 u 23 (a ]23 ) 9 and m 3} \i 23 (a 123 ). The vector m stores these messages in 
that order. 
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There is a set of self-consistent equations for the cluster probabilities in terms of the 
messages where the equations have the matrix form: 

L = B m, [32] 

or equivalently: 

A X =B m, [33] 
where B is a "selection" matrix that determines the choice of the message-passing 
method. This matrix is specified below. 



The matrix B and the vector in constitute a valid reformulation of the Kikuchi 
solution as long as the system described by equation [33] has at least one solution 
that can be written in the form of a matrix equation: 

m=C X. [34] 

Any valid reformulation of the Kikuchi solution can be used as the basis for our 
message-passing method. 



To illustrate this idea, we return to our three-cluster example of Figure 4a, for which 
we have already chosen the vector m . We can now choose many different matrices 
B, but we use one that arises from a scheme that we describe in more detail below. 



Recall that the m's defined above are the logarithms of messages, which we denote 
by M's. We now set the marginal posterior probabilities of the regions to obey the set 
of equations: 



P(a } ) = k x exp 



irA 3\31 23U23 



[35] 
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P(a 2 ) = k 2 exp 
P(a 3 ) = & 3 exp 
P(ar 12 ) = /: 12 exp 
P(a 23 ) = k 23 exp 
P(a 3} ) = k 3] exp 



f E(a 2 ) 
T 

E(a 3 ) 



T 

T 

K J 

f E(a 23 ) ^ 
T 

v 1 J 



1\12 (**12 )^3\23 (^23 )^31\123 (^123 ) 
1\31 

(a 3l )M 2X23 (a 23 )M 

23X123 

(a I23 ) 

^1X12 (^12 )^2X12 (^12 )^23\123 (^123 )^31\123 (^123 ) 
^3X23 (^23 )^3\23 (^23 )^12\123 (^123 )^31\123 (^123 ) 



( E{cc 3X )\ 



M i\3 1 (fic 3 , )M 3V3 , (or 3 , ) M , 2U 23 (or, 23 ) M 23N1 23 (ofj 23 ) 



P{a m ) = k n3 exp 



^12X123 (^123 )^23\123 (^123 )^3 1X123 (°M23 ) ' 
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[36] 
[37] 
[38] 
[39] 
[40] 
[41] 



where again the k's are normalization constants that vary for each equation, as for 
equations [22-28] above. 

Taking the logarithm of these equations, and recalling the order of the ra's given 
above, we see that equations [35-41] are equivalent to the choice 



B 



0 


0 


1 


0 


0 


1 


0 


1 


0 


1 


0 


0 


0 


1 


0 


0 


0 


1 


0 


1 


0 


1 


0 


0 


1 


0 


0 


1 


0 


1 


0 


0 


0 


0 


1 


1 


0 


0 


0 


1 


1 


0 


1 


0 


1 


0 


1 


0 


0 


0 


1 


1 


1 


0 


0 


0 


0 


0 


0 


0 


1 


1 


1 



[42] 



The choice of B and in give an equivalent reformulation of the region probabilities 
when there is a solution to the equation: 

m = C X, [43] 
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where 

C = B+A, [44] 
and B + is the pseudo-inverse of B. Of course, we can always compute a C matrix as 
above, but for the matrix to represent a true solution, we must verify that matrix 
5 equation 

BC = A [45] 

is satisfied. 

If a solution to A X - Bin exists for a particular choice of matrix B and message 
10 vectors m , we discard the original Lagrange multipliers X and work entirely with the 

e messages. In this case, we guarantee that the self-consistent equations for the 

ifi 

CO messages will ultimately give the same cluster probabilities as those given by the 

Td Kikuchi approximation. 

15? The next step is to determine the self-consistent equations for the messages. To do 

p this, we need to combine the equations for the cluster probabilities in terms of 

L| messages with the marginalization conditions on the cluster probabilities. In terms of 

k " our example, we use, for instance, the equation P(a X2 ) = ^ P(a x ) along with the 

equations for P(aj) and P(a }2 ) in terms of messages to obtain 
20 M 1U2 M 31U23 = k^J ^'^ Xl^ , [46] 

where k is again a normalization constant. In all, there will be one such equation for 
each marginalization constraint. 

We use the D x equations of the form of equation [46] in our message-passing 
25 method, by interpreting them as message-update rules. In this interpretation, any 
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messages on the right-hand-side of an equation are treated as the messages at 
iteration t, while the messages on the left-hand-side of the equation are the desired 
messages at iteration t+1. We start, for example, with random messages 331, and at 
each iteration replace the "old" messages on the right hand side of the equation with 
"new" messages from the left-hand side. In practice, the messages usually converge 
to a fixed point. When they converge, the fixed point corresponds to the solution in 
which we are interested. 

As for all methods of this type, we can update the messages synchronously or 
asynchronously, and we can update the messages with some linear combination of 
the "old" and "new" messages. Our method is an iterative method for the solution of 
nonlinear equations, and as such, its convergence properties may be improved by 
using standard prior-art techniques for such methods, see "Iterative Methods for 
Linear and Nonlinear Equations", C. T. Kelley, 1995. 

Regarding the normalization constants k, we could continue to set these 
normalization constants by requiring that the total cluster probabilities sum to one. 
However, another procedure that works equally well, and is easy to implement, 
temporarily scales the messages by requiring that the sum of all the components of a 
message to equal a given arbitrary constant while iterating the self-consistent 
equations. Reasonable choices for the constant are one, or the dimensionality of the 
message. Thus, for instance, we can require that: 



After iterating the message equations to convergence, we re-scale the messages to 
guarantee that the region probabilities are properly normalized. 




[47] 
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We have described the messages as being associated with a region r and one of its 
direct sub-regions s. We can equivalently describe the message as going from a set of 
"source" nodes to a set of "destination" nodes. The "source" nodes are those nodes in 
region r that are not in direct sub-region s, while the "destination" nodes are those 
nodes in s. 



To obtain probability functions defined on a subset of the nodes in a region, one 
simply marginalizes the region probability by summing it over those nodes that are in 
the region but not in the subset of interest. 

We can also determine the maximum a posteriori probability values of the nodes in a 
region rather than the region probabilities. For the prior-art "standard" belief 
propagation method, a simple modification suffices to obtain the approximate MAP 
values of the nodes. The necessary modification is to replace all sums over node 
values in the message-update rules with the value of the maximum, i.e., the 
"argmax," over the same node values. The same modification can also be used to 
obtain the MAP values of regions for our method. Alternatively, we can use the 
physics formulation of the Markov network, and set the temperature Tto a very small 
positive constant. In the limit T 0 , we recover region probabilities that are entirely 
dominated by the MAP values of the nodes. 

The message-update equations for some arbitrary B matrix is not, in general, 
convenient to solve. With an arbitrary B matrix, one might find that a message 
between two nodes on one side of the network would depend on messages between 
other nodes that are very far away. This is an undesirable property, both because it 
seems unintuitive, and because it means that solving the message-update equations 
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will involve matrix multiplications with very large matrices. Finding the free energy 
for arbitrary B matrices also involve large matrix multiplications or inversions. 

In order that iterating the message update equations be an efficient operation, we 
desire that the matrix B is sparse and local, i.e., only a few non-zero elements for 
each row or column, and only nearby messages influence each other. 

In the next section, we describe a "canonical" message-passing method that yields 
these desired properties. 

A CaMMrnical method 

In the previous section, we described the properties that must be satisfied by 
messages and their associated message-update rules, which we call the "message- 
passing method," in order that the messages give results equivalent to the Kikuchi 
approximation. In general, there may be many different message-passing methods 
that are all equivalent to the same Kikuchi approximation. In this section, we 
describe how to select one particular message-passing method that we call the 
"canonical" method. 

Our canonical method gives the same message-update rules as the standard loopy 
message-passing method when applied at the level of the Bethe approximation. More 
generally, and as an advantage, our canonical method leads to local message-update 
rules that are relatively easy and convenient to solve because they correspond to a 
sparse B matrix. 
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We start with a Markov network with local compatibility functions % and evidence 
functions % . The overall probability of a configuration of states s at s b , ... is 
expressed as: 



5 where the first product runs over all linked neighbors, i and j. 

Using Figure 5, we illustrate this procedure with a specific example. In this example, 
nine nodes labeled a through /, are arranged in a 3x3 lattice 500. The only statistical 
dependencies or links (p^s if Sj) included are the ones connecting nearest neighbors in 
lfi the lattice. 

j ; Geeeral procectaire 



^ The detailed steps of a general method is described with respect to Figure 6. In step 
Wi 601, a network 600 of nodes and links is grouped 601 into clusters or "super-nodes" 
fy 650 to be used in the approximation. These clusters can have a simple geometric 
p shape, repeated over the lattice. Alternatively, the clusters can be triangles or squares 
of an irregular network, or the clusters can be some other arbitrary choice of node 
clusters best suited for the approximation. The set of clusters must cover all the links 
20 in the Markov network, and the clusters must be small enough so that we are able to 
perform the generalized belief propagation operations described below. 

In our example, we group the network 500 into clusters: (abde), (beef), (degh), and 
(efhi) 501-504 of Figure 5. This set of clusters "covers" the lattice 500. That is, every 
25 link between nodes which influence each other is completely included in at least one 
of the four clusters we have selected. Of course, we can select a different collection 



[48] 
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of clusters which completely covers the lattice. That would just be a different 
Kikuchi approximation. 



As stated before, the size of the clusters is proportional to the accuracy of the 
5 approximation. If we chose a cluster (abcdefghi), i.e., all the nodes in the system, 
then our approximation is exact. While that grouping may be feasible with such a 
small example, in general, we will perform summations whose number grows 
exponentially with the size of the cluster. Therefore, in a practical application, the 
number of nodes in a cluster is limited. 

10 

4* The prior art Bethe approximation is obtained when we select as the clusters all the 
% pairs of nodes that influence each other. In our example, that would mean 

J partitioning the nodes pairwise into clusters (ab), (be), (de), (ef), (gh), (hi), (ad), (be), 

ril 

5 ( c f)> (dg), (eh), and (fi). Because our four clusters are larger than the twelve clusters 

IS of the Bethe approximation, we obtain more accurate results. 

cs 

m In step 602, we identify a collection of relevant regions 651 by recursively 

P 

P generating intersections of clusters, for our example, the regions, (be), (de) (fe), (he), 
and (e). We discard duplicate regions. 

20 

In step 603, we arrange the regions into a top-to-bottom hierarchy 652 of their 
intersections. The hierarchy for our specific example is shown in more detail in 
Figure 7. This hierarchy has the property that each region is connected only to its 
"direct sub-regions" and "direct super-regions." A sub-region s is a "direct sub- 
25 region" of another region r if there are no other regions in our collection that are 
super-regions of s and sub-regions of r. In our example, regions (be) and (de) are 



34 



MERL-1272 
Yedidia et al. 

both direct sub-regions of region (abde) 9 but region (e) is not. Region r is a direct 
super-region of region s if and only if region s is a direct sub-region of region r. 

In step 604, we identify messages M 653 which connect all regions with their direct 
sub-regions. In our example, we would identify messages M abde \b e , M abde \j € , M bce f^ e9 

Mb ce ff e , Mdegh\de> M degh he> ^efhiYe> ^efhMiey M be \e, M de \e, and Af^. 

We can identify these messages using a more compact notation that we illustrate with 
one example: 

M ad s Af T471 

lYA be iYl abde\be * L* ' J 

For this message, we send a message from the upper nodes a and d to the lower 
nodes b and e. We can consider the upper nodes to be the "source" nodes for the 
message, and the lower nodes to be the "destination" nodes. Each of the messages 
between a region and its direct sub-region has dimensionality corresponding to the 
dimensionality of the sub-region. Thus, M a be has the same dimensionality as region 

(be). When we want to emphasize that fact, we express the message as M b *(s b9 s e ) . 

In step 605, we construct a probability function P( a R ) and a local potential <P(a R ) for 
each region R. We construct a vector L out of the quantities L(a) = In P(a) - In ®(a) . 
For our example, we need to construct probabilities P(s a , s b 9 s d , s e ), P(s b , s c 9 s e 9 s f ) 9 
P(s d ,s e ,s g ,s h ) 9 P(s ef s f9 s h9 s i ) 9 P(s b9 s e ) 9 P(s d9 s e ) 9 P(s f9 s e ) 9 P(s h9 s e ) 9 and/>(s e ). The 
local potentials &(a R ) are constructed by combining all the original potentials ^ and 
^ that belong to the region R. Thus, for example, <P(s a ,s b ,s d ,s e ) = (p a b(s a ,Sb) <Pad(s a ,Sd) 

<Pbe( Sb> sj <Pde( Sd, S e ) %( So) W Sb) W Sd) %( S e ), [48] 
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while ®{s b ,s e ) = <t> be {s b ,s e )y/ b {s b )\i/ e (s e ) and 4>(s e ) = i// e (s e ). 

Each probability function has the dimensionality of the region it describes. For 
example, if node b can be in any of three states and node e can be in any of four 
5 states, then P( a be ) is described by 3*4=12 numbers. 

In step 606, we specify the matrix B 656 that determines the selection of the 
message-passing method. This matrix will operate on the vector m created from the 
logarithms of the messages specified in step 604. The matrix B has a block form that 
10 we explained when describing equation [33]. This special block form "hides" the 
!g details of the different possible values of the nodes within each message or region. 

IP* 

Using this block form, the matrix B 655 has N R rows and N M columns, where N R is 
^ I the number of regions, and N M is the number of messages. Each entry in the matrix is 

SI either a 0 matrix or a 1 matrix. We determine the entries of the matrix B by the 

sy 

lfL following sub-step 606 for each entry: 

£ (a) Label the region corresponding to the row of the entry R R , and the region 

[Jj and sub-region of the message corresponding to the column of the entry R c and S c 
Q respectively. By R c \ S c , we denote the set of nodes that are in R c but not in S c . 

(b) If the nodes in S c are the same as the nodes in R R , or if the nodes are a sub- 
20 set of the nodes in R Rj and if all the nodes in R c \ Sc are not in R R , then the entry is a 
T matrix. Otherwise, the entry is a 0 matrix. 

Explained more intuitively, this rule says that every region gets messages that start at 
source nodes that are (CMltsMe of the region and end at destination nodes that are 
25 inside of the region. In our example, assuming the regions are listed in the order 
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(abde), (beef), (degh), (efhi), (be), (de), (fe), (he), (e), and the messages are ordered 
as we listed them above, the matrix B is: 



00101000001 1 

ToooooiooToo 
0T00060TT0T0 
oooToTooTToo 

5=101000000111 

oTooToooToTT 
oooTooToTToT 
oooooToTTTTo 
ooooooooTTTT 



[49] 



v 



j 



We explain one of these entries: the 1 matrix in the first row, third column. The first 
row corresponds to the region (abde), while the third column corresponds to the 
message M c l . Both the nodes c and/ are outside of the region (abde) and both the 

nodes b and e are inside the region (abde), so the conditions for choosing a T matrix 
are met for this entry. 

One important advantage of this rule is that when it is applied to the messages and 
regions obtained from the Bethe approximation, the rule will always result in a 
matrix B that will ultimately yield the standard "loopy" belief propagation. Another 
important advantage is that this rule will normally give back a matrix B that results in 
a system of equations A X = B in that is solvable. Another advantage is that this rule 
will result in a B matrix that is sparse and local. 



In step 607, we exponentiate the equations represented by the matrix equation 
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L = Bin to obtain the equations 656 relating the region probabilities to the messages. 
For our example, we obtain nine equations from the nine rows of the matrix 
equation. As an example, we show how to construct the equation for the first row. 
The first row of the B matrix corresponds to the region (abde). Reading across the 
first row, we see that there are T sub-matrices in the third, fifth, eleventh, and twelfth 
columns, which correspond to the messages M b f e , Mf e , M f e , and M h e . Therefore, the 
equation for the first row is: 



The other equations are constructed in the same way. For example, the equation for 
the fifth row is 

P(s b ,sJ = 0( Sb ,sJM^( Sb , S JM^s b ,sJM d e (s e )M!(s e )M:(s e ). [51] 



In step 608, we express the N m constraint equations on the region probabilities, and 
substitute in equations relating the region probabilities to the messages in order to 
obtain a set of self-consistent equations M c (update rules) for messages 657. 

In our example, we have twelve constraint equations. In the canonical method, there 
are always exactly as many constraint equations as there are messages. One of the 
constraint equations in our example relates P(s a ,Sb,Sd,s e ) to P(Sb,s e )\ 

P(s b ,s e )=^P(s a ,s b ,s d ,s e ). [52] 

Using the results from the previous step, we obtain the self-consistent equation for 
the messages from the marginalization (abde) — » (be) : 

(s b9 s, )M d e (*,) = 2 <t> ab (s a , s b )(t> ad (s a ,s d )<f> be (s b ,s € )y. (s a )y/ d {s d )M? e (s d ,s e ). [53] 
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Eleven other similar equations can be obtained from the constraint equations. From 
the marginalization of region (abde) — > (de) we obtain: 

Mt (s d , s e )M b € (s e ) = X <p ad (s a , s d )(/> ab (s a , )^ (j, , 5, )yr. (5. ) Wh (s b )M* {s d , 5, ) [54] 

5 From {beef) — > : 

( , )Af / (s, ) = X 0* > * c )&/ ( *c > (** ^ Wc ( s c )¥ f )Af J ( J, , ) [55] 

From (fcce/) ->(/<?): 

s c s b 

O From (degh) — ► (de): 
1 1 M^(s d ,s e )M h e {s e )- ^<l>dg( s d> s g )<P gh . (J* . * . K )V* ( , j, ) [57] 

y v* 
ni From (degh) — » (Tie): 

B MjE?(^,*^ [58] 

,KSSS. 

fp From (e/fti) — > (/e): 

| M J , )M * ) = X ^ (,, , )0 hi (s h , )0 fe (s, , s e ) ¥h (s h ) ¥i {s, )Mg (s h , s e ) [59] 

O Va 

15T From (e/7n) —> (he): 

{s h , s e )M/ (s e ) = 5>« (5 A , 5, ty, (J, , 5,. )^ , s t )w f (s f ) ¥i is, )M% (s f , s e ) [60] 

S;S f 

From (be) — > (e): 

^ .* (<0 = S ^ ('» , s e )W„ (h )M£ (s b , s e )M* (s b , s e )M' k (s„ )M c b (s b ) [61] 
From (de) — > (e): 

20 (*) = X f A ( j, , 5, ( j, )A#J* (5, , )Af £ (5, , s e )M ° (s d )M j (s d ) [62] 

From (fe) (e): 
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M e Ue) = ^ fe (s f ,s e ) ¥f (s f )M^(s f ,s e )Mf e ( Sf ,s e )M c f (s f )M i f (s f ) 



[63] 



From (he) — > (e): 



[64] 



In step 609, we solve the self-consistent message equations 657 by iteratively 
updating the values according to the rules, beginning with some initial message 
values 658 and some "evidence" ^659. The initial values 658 can be random or 
special values, e.g., all random positive numbers, or all ones. The "evidence" is given 
by the values of the ^variables. 

For this sub-step, there are no guarantees of convergence, though for many cases in 
practice, such systems of equations will quickly converge. One can continue 
updating the message-update equations until they satisfy some termination condition, 
or alternatively stop after a given number of iterations. 

For our example, in order to obtain some numerical results, we need to specify the 
values of the variables defining the model. We choose the case where all the nodes 
can be in one of two states, and where the local potentials (p^(Si,sj) can be written as: 



where is a bond strength and J is the temperature parameter. This is known as the 
Ising model of statistical physics. We set all the bond strengths between nearest 
neighbors Jy = 1 and the temperature 7=2. We also set the yffai) "evidence" 
potentials to be uniform for all nodes i: 



exp(7, y /r)exp(-7,/r)> 
exp(-7, y /r)exp(7, y /r) ' 



[65] 
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This corresponds to the Ising model in a zero field. 



[66] 



Convergent dynamics are obtained for this model by taking, at each iteration, the 
5 new messages to be an average of the "old" messages and the new computed 

messages. The messages from one node to one node are determined first, and then 
these results are used to determine the two node to two node messages. In general, 
one can order the message-update equations so that the messages that depend on the 
fewest nodes are solved first, so that those results can be used on the left-hand-side 
10 of the equations for the later messages. 

Ui 

vp Next, we solve for the region marginal probabilities 660 using the converged values 

fy 

m of the messages and the equations 656 previously derived. We insure that the region 
probabilities are normalized so that they sum to one. 

i 

y Interestingly, for our example, the final messages obtained from a "run" depends on 
J2 the initial messages. However, the final converged region probabilities derived from 
the final messages always are the same. This reflects a general phenomena, which is 
the fact that when there are more messages than regions, the messages form an over- 
20 complete representation of the region probabilities. 



The numerical results obtained using our four cluster Kikuchi approximation are 
excellent and much better than those obtained from the prior art pairwise Bethe 
approximation. 

25 
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To take one example, the exactly computed value of the probability function P(sb,s e ) 
is, to six significant digits: 

f0.406496 0.093504 ^ 
0.093504 0.406496 



Our approximation yields: 

P(s b ,s e ) = 



[66] 



f 0.406659 0.09334 O 
0.093341 0.406659 



[67] 



while the result from standard belief propagation, equivalent to using the Bethe 
approximation, is: 

f 0.365529 0.1 3447 P 
0.134471 0.365529 



[68] 



Thus, our generalized belief propagation method has eliminated more than 99.6 
percent of the error associated with the prior art "standard" belief propagation in this 
case. This is a significant improvement over the prior art. 



Calculating other numerical quantities gives similar improvements for our example. 
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Determining the KikecM free energy 

So far, we have described a process for generating message-passing methods. We 
have not yet mentioned the Kikuchi free energy calculations that motivated our work. 
In fact, our method as described above has practical use by itself. On the other hand, 
for some applications, our method may be of interest to determined the Kikuchi free 
energy, as well as other physical quantities, such as the internal energy, the entropy, 
the specific heat, magnetizations, susceptibilities, and possible critical temperatures 
that can be derived from the Kikuchi free energy. 

The Kikuchi approximation to the Gibbs free energy includes the following parts: the 
internal energy (enthalpy) U, entropic term -TS, and Lagrange multiplier terms 
constraining the normalization and marginalizations of the region probabilities. The 
full formula for the Kikuchi approximation to the Gibbs free energy is given in 
equation [14]. 

For the internal energy and entropy parts, one must remember to weight each 
region's contribution by its over-counting number d r . After the message-passing 
method has converged, the Lagrange multiplier terms all sum to zero and can be 
ignored, because the corresponding constraints are satisfied. 

Application of the canonical generalized belief propagation method to some 
important Markov networks 

It may be helpful if we explicitly express the equations obtained using the described 
canonical method for some common Markov networks. We consider a square lattice 
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with clusters consisting of all the four-node squares in the lattice, and a triangular 
lattice with clusters consisting of all the three-node triangles in the lattice. 



Sqeare lattice 

5 

In a square lattice of nodes each node is connected to its four nearest neighbors. 
Figure 2(a) is an example of a square lattice. This is an important Markov network 
for image processing or computer vision applications. Using each possible four-node 
loop as a cluster, we derive a set of message-passing rules which improve the 
10 accuracy of the computed marginal probabilities over the Bethe approximation's 
<? performance. To obtain improved performance, we introduce double-indexed 
W messages, i.e., arrays of numbers, in addition to the single-indexed messages, i.e., 
CP vectors that are part of the Bethe approximation message-passing. 

* y 

15 Following the implementation rules described above, we derive the message-passing 

,»==. 

et rules, and marginal probabilities in terms of those messages. These, we list below. 

ij We list only the marginalization and propagation equations for the regions away 

o 

h from the edges of the array of nodes. 

20 The regions obtained from the intersections of the original square clusters are regions 
consisting of two nodes connected by a link, and regions consisting of a single node. 



The marginal probability of a region consisting of a single node is expressed as: 

P(S a ) = M:M b a M c a M d a¥a , [73] 

25 where P(S a ) is the marginal probability for each of the different possible states S a9 of 
the random variable at node a, and Af a is the converged value of the message from 
source node e to destination node a, and similarly for the messages from the other 
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nodes neighboring node a. Each of those messages is a vector of the dimensionality 
of the number of different possible states at node a. To avoid cluttering these 
formulae, we have not suppressed the functional dependence of the messages; to 
make it explicit, one would write for example M d a (S a ) . Figure 8(a) is a graphical 

representation of equation [73]. Each arrow symbolizes a single-indexed message. 

The marginal probability of a region consisting of two linked nodes is expressed as: 

^S b ) = MlMlMiMiMiMiMiM^^fj/f, , [74] 
where the termS^cedefined analogously as with equation [73], and, for example, 
M c ab means the double-irtde^ed message from source nodes ch to destination nodes 

ab. Here, the value <p^ is the compatibility matrix between nodes a and b, and is a 
function of S a and S b . Figure 8(b) is a graphicM^i^gresentation of Equation [74]. An 
arrow dragging a line symbolizes a double-indexed mes§agQ^and a simple line 
connected two nodes symbolizes a compatibility matrix. 

The marginal probability of a square region is expressed as: 

P(S a ,S b ,S r S e )=M:M d a M*M h b M k f M l f MW [75] 

with the notation defined as for the equations above. Figure 8(c) is a graphical 
representation of Equation [75]. 

By marginalizing equation [74] over St and setting the result equal to equation [73], 
we obtain one of the two sets of self-consistent message equations that are satisfied 
by the converged values of the messages: 

M b a = MlMZM k b M*M*<pjp b . [76] 



Figure 9 is a graphical representation of equation [76]. 
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By marginalizing equation [75] over S e and S f , and setting the result equal to equation 
[74], we obtain the other set of self-consistent message equations: 

* >,er _ M )M l f M i e MjM ] *M§ M^ a (p ae (p ef (p fh xif e \if s 

ab ~ M e Q M[ ' Un 

For the values of M e a and M{ in equation [77] , we use the values determined from 
equation [76]. Figure 10 is a graphical representation of equation [77]. 

lS Triamgelar lattice 

^ A triangular lattice is another Markov network of particular interest for modeling two 

W dimensional datasets. Figure 2(b) is an example of a triangular lattice. Analogously 

s with the equations and notation as described above, we described the following. 
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The marginal probability of a single-node region is expressed as: 

P(S a ) = M h a M c a M i a MlM i a M l aWa . [78] 
Figure 1 1(a) is a graphical representation of equation [78]. 



20 The marginal probability of a region consisting of two linked nodes is expressed as: 

P{S a ,S e ) = M^M:MtM[M[MtMiMlM h XXaeKe(p a e¥ a ¥e- [79] 
Figure 1 1(b) is a graphical representation of equation [79]. 

The marginal probability of a triangle region is expressed as: 
25 P(S a , S e , 5,) = M' a M b a M c a M d a M^M{MfM^MfM/M \M\M l JA'JA'frjp t ffijw# l . [80] 
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Figure 11(c) is a graphical representation of equation [80]. 
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By marginalizing equation [79] over S e9 and setting the result equal to [78], we obtain 
the self-consistent message equation: 

5 Ml = MtMiMtM h e M[MtMae<Pae¥e [8 1 ] 

Figure 12 is a graphical representation of equation [81]. 

By marginalizing equation [80] over S h and setting the result equal to equation [79], 

we obtain the self-consistent message equation: 

10 M i = M l k MjM I M l ai M^ i (p ai (p ei \i/ i 

O " KK 

•j'r 5 

m Figure 13 is a graphical representation of equation [82]. As described above for the 

s v, a 

m square lattice, for the values of M l a and M\ in equation [82], we use the values 

irf \ 
% *^ 

s determined from equation [81] during the same iteration. 



Modified Loopy Seper-iniode Method 



We now describe an alternative embodiment of our invention. We call this 
embodiment the "modified loopy super-node method." This method is partially 
20 motivated by Pearl's "clustering method", and the related "junction tree method" 
which are prior-art methods that has been employed to obtain exact answers for 
Markov networks with only a few loops, see "Learning in Graphical Models", edited 
by Mi. Jordan, 1998. 

25 In the Junctionjrge^ equivalent to the 

" ■ ■ . — 

original one by grouping the original nodes into "super-nodes." New ^compatibility 
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matrices and ^evidence functions are also generated for the super-nodes such that 
iheprobability of any state using the super-node formulation is equivalent to its 



the standard belief propagation method on a super-node network that was designed to 
have no loops is called the "junction tree method." 

Of course, if the super-node network still has loops, one is still faced with a difficult 
problem which standard belief propagation will not solve exactly. Nevertheless, one 
might consider the obvious "naive" method of using ordinary loopy belief 
propagation on the super-node network. This "naive" method, which one might call 
the "loopy super-node method," sometimes gives acceptable results, but empirical 
testing shows that it also can give very poor results for some simple networks. 

In our alternative embodiment of our invention, we make a modification to the loopy 
super-node method wfiictl^uafantegsj^ when it converges. Our modified method 
also gives results that are equivalent to using aKilalcIirappFOx clusters 
corresponding to the chosen super-nodes. Thus, our "modified loopy super-node*^- 
method" is an alternative method that obtains the same results as our "canonical" 
method described above. 

For concreteness, we illustrate the modified loopy super-node method using the same 
specific Markov network model and choice of clusters that we used to illustrate the 
"canonical" method, see Figure 14. This network consists of nine nodes (a through 
0, and we choose four arbitrary clusters (abde), (beef), (degh), and (efhi). These four 
clusters are also referred to as "super-nodes." 
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Constructing a super-node network 

We begin by reformulating the average energy U. The Kikuchi approximation for the 
average energy is in fact exactly correct when we use the correct marginal 
probabilities. For our example network, the average energy is expressed within the 
Kikuchi approximation as: 

U = I Pia^ )E(a Me ) + £ P(a bcef )E(a bcef ) + X p «*** ) E &** > +X P «V > 

a abde a bcef a degh a efh\ 

-X/»("*)£(a*)-X/>( a *^ 

<*bt a de <*fe a ke <*e 

This expression includes terms that sum over the intersection regions (be), (de), (fe), 
(he), and (e). As for the canonical method above, we want to eliminate these terms 
and have an expression that only contains sums over the super-node regions. We can 
easily do that because each of the sums over the intersection regions can be 
expressed as a sum over a super-node region. For example, 

X P{a be )E(a be ) = X p «*** ) • [84] 

a bt a abde 

This equation is true because of the marginalization formulae for P(a be ): 

X PCa^ )E(a be ) = X S P&** ) [85] 

a abde a be a ad 



= I,E(a be )J,P(oc abde ) [86] 
= ^E(a be )P(a be ). [87] 



Therefore, the Kikuchi average energy can be expressed as: 
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U = I P(oc Me )E(a Me ) + £ P(a fcce/ )f (or^ ) + £ P(a Afl )f(a Afl ) + J )£(a^ ) . [88] 

a abde a bctf a <kgh a iflii 



The E terms are not uniquely determined. They depend on how we choose to 
distribute the intersection region terms. One choice is: 

5 E(a abde ) = E( a<lbde ) - E(a be ) + E(a e ) [89] 

E(a bcef ) = E( abcef )-E(a ef ) [90] 
E{a ghde ) = E{a ghde )-E{a de ) [91] 
E(a eJhi ) = E(a eflti )-E(a he ). [92] 

lfi Our goal is to construct a Markov network between the super-nodes that is equivalent 

JJ? to the original Markov network. Let us label the four super-nodes by the numbers 7, 

J 2, 3, and 4, see Figure 15a. We construct the evidence functions using the equation, 

tU for super-node 7. 

1 WW = [93] 

lfp and similarly for the other super-nodes. 

JSSS. 

The compatibility matrices between the super-node regions are chosen so that if two 
super-nodes are in consistent states, then the compatibility function is one, otherwise 
the compatibility function is zero. For example, if super-node 7 is in a first "super- 
20 node" state such that node b is in some first "node" state, and super-node 2 is in a 
second "super-node" state such that node b is in a second "node" state, than the two 
super-nodes are in inconsistent "super-node" states. The compatibility 
matrix <p ]2 (a x ,a 2 ) should equal zero for those particular states. 
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If we define the evidence vectors and compatibility matrices as described, then the 
joint probability for super-node states is: 



(n and m are super-node indices) will exactly agree with the joint probability P 
defined for the original nodes. 

We summarize our construction in Figure 15a. We define a new "super-node" graph 
1500 that has four super-nodes 1501-1504: 1 = (abde) , 2 = (beef) , 3 = (degh) , and 
4 = (efhi) . We denote the local "evidence potentials" for the four super-nodes by y/ . 
We set the pairwise connections (links) ^ 1511-1514 between the super-nodes so 
that super-nodes must agree on shared nodes. By construction, the super-node 
network 15a has the same probability distribution as the original network of Figure 



In order that our super-node network is consistent with the Kikuchi approximation to 
the Gibbs free energy, we also must enforce all the constraints between the regions 
involved in the Kikuchi approximation. We aim for a minimal set of constraints 
(Lagrange multipliers). Note that in the canonical method, we used an over-complete 
set of constraints. The following set of marginalization constraints are sufficient for 
our example network: 



[941 



n mn 



14. 




[95] 



^P{oc bcef ) = P{cc ef ) = ^P(a efhi ) 



[96] 




[97] 
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X^C***) = p (a de ) = 2 P «W [98] 
^P(aJ = P(a e ), [99] 
in addition to the standard normalization constraints on the super-node regions. 

With these choices, the Kikuchi free energy for this example network takes the 
following form: 

G = £ P{a x )E(a } ) + 2 P(a 2 )E(a 2 ) + ^ P(a 3 )E(a 3 ) P(a 4 )E(a 4 ) 



«3 



a 2 



-T 



X/ > (a fa )lnP(a fa ) + Xm <fc )lnP(a & ) + X^) ln ^ a A) + Z p ( a * e ) lnP ( flr *«) 



+ 



r£P(flf e )lnP(a e ) 





+ r 2 


\-^P(a 2 ) 


+ 73 


l-5>(tf 3 ) 

«3 


+ r 4 


l-X^(«4) 

#4 



+24\*(«*) 



+2^^) 

+ 2^v(a«) 



^)-2 p («i) 

Pia^-^Pia,) 

a \\de 

P{a fe )-J d P{a 2 ) 

<*2\ft 

P(aJ-^P(a 3 ) 



+ 24n*< a *) 

+2^( a *) 



^)-2 p ( a 2) 

Pia^-J^Pia,) 



J °W 



+2 ; M ar P 



+2 ;l 4\ fe K e ) 



^(« /£ )-2 p («4) 



^)-2 p (° r 4) 



#4^ 



P{a e )-^P{ abe ) 



[100] 
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When we differentiate the Kikuchi free energy with respect to the super-node region 
probabilities, we obtain: 

P{a x ) = k l y/ l (a, (a be )\ de (a de ) [101] 

P(a 2 ) = k^iaj^iaj^ia^ [102] 
P(a 3 ) = k&iajAwiaM^iaJ [103] 

P(fl 4 ) = ^4K)4/e(^)4,K) ■ [104] 

Note that these equations are essentially identical to those that would be obtained 
from the naive loopy belief propagation method on the super-node graph. For 
example, using the naive loopy belief propagation method, the marginal probability 
at super-node 1 would be: 

P(a l ) = kvM)M?(a l )M?(a x ), [105] 
where M?(a x ) is the message from source super-node 2 to destination super-node L 
At first sight, there is a difference with equation [101] because M?(a x ) seems to 
depend on the state of all four nodes (abde). In fact, when we take the compatibility 
matrices into account, we finds that in the naive loopy belief propagation method, the 
messages between super-nodes only depend on the states of the nodes that they 
share. Thus we can make the identifications M x (a be ) = \^ e {a be ) , M x (a de ) = ^ ye (a de ) , 

M l 2 (aJ = Z Mie (a be ) 9 M\{a fe ) = ^ fe (a f€ ) , M\(a de ) = Z,, de (a de ) , M^(a he ) = Z^ he (a he ), 
Ml(a fe ) = X 4Ve (a fe ) , and M\{a he ) = ^ he (a he ) . 

In Figure 15b,we draw graphically these messages passed in the naive loopy belief 
propagation method on the super-node graph. 
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When we substitute the region probabilities from equations [101-104] into the 
constraint equations [95-98], we obtain: 

[106] 

a \\be a l\bt 

*i X V\ «*i )A>\be(aMue«xJ = * 3 Yw^W^MM^J [107] 



a l\de a 3\de 

^YlWA^&jAnft&fi) = k 4 ^V 4 ((XM4\fe(<Xfe)t<4\He(<X>,e) [108] 

a 2\fe a 4\fe 

^WiW^iaJ^iaJ = k 4 ^ip 4 (a 4 )A 4Ve (a fe )A 4Vle (a he ) . [109] 



These equations will be satisfied when we make the identification between the 
m Lagrange multipliers X and the loopy belief propagation messages indicated 
lQi previously, and require that the messages obey the standard propagation equations. 

lij Take for example, equation [106]. We can simplify this as follows: 

Q ^M^J^^A/^)^ [110] 

1;5| But if we satisfy the standard belief propagation equations on the super-node graph 

Ufa J = *5> 2 (« 2 )M 2 V /e ) [l 1 1] 



20 



Mliaj = k^MMf(a de ) , [1 12] 

a \\be 

then equation [110] will be satisfied. Thus equations [106-109] are equivalent to the 
naive loopy belief propagation equations for the super-node network. 

However, there exist additional equations that need to be considered in the Kikuchi 
formalism that give rise to a modification of the naive loopy belief propagation 
method. In particular, when we differentiate the Kikuchi approximation to the Gibbs 
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free energy with respect to P(a de ) , P(cc fe ) , P(a he ) > Pi^) , and P(a e ) , we obtain the 
equations: 

P(a*) = k*^(**)***(<x*) [113] 
/>(ar /e ) - * A ^(flf A )^(^) [114] 

P«*J = [1 15] 

^iJ = *^(^)^(^e)^w(«J [1 I 6 ] 

PW = kJ^(a € ). [117] 

The first three of these equations are satisfied automatically when we identify the /l's 
with belief propagation messages as before. For example, for equation [1 13], we 
have: 

P«**) = = k^(a^M?(aJMf(a de ) = k x M x \a de )^ij/ x (a x )M x 2 {a be ) 

Wde Wde Wde [118] 

= k x Ml{a de )M\{a de ) = Mn*(«J>^(«*) 

However, equations [1 16-1 17] are not necessarily satisfied when we take the /l's to 
have the same values as the belief propagation messages. There is another condition, 

which can be obtained from the constraint equation P(a e ) = ^P{cc be ) . Using this 

a b 

equation, we find: 

T^e(<X b e)^e(<X b e) = k, [119] 

where k is some constant. As usual, the actual value of the constant is not important; 
what is important in this case is the fact that this sum should not depend on the state 
of node e. We can guarantee that all the derivative equations [106-109] and [113- 
1 19] are satisfied if we identify the A's with loopy belief propagation messages, but 
make sure that they satisfy the additional constraint of equation [119]. 
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We now define a "normalization" operator. In Figures 15c and 15d, the 
normalization operator is indicated by ellipses 1530, 1541-1544. The normalization 
operator takes as input the messages M?(a be ) and M\{a he ) and returns as output 

normalized messages Norm(M?(a be )) and Norm{M\{a be )) defined by: 

Norm(M?(a be )) = . M J (a ^ , [120] 



Norm(M! 2 (a be ))= . M } (aJ i [121] 



By construction: 

' Norm(M? (a be )) Norm(M l 2 (a be )) = 1 , [122] 



so if we identify the A's with normalized belief propagation messages, equation [119] 
;L will be satisfied. The normalization constraint [119] can be satisfied simultaneously 

|j with the ordinary belief propagation constraints implicit in equations [106-109], if 

Q 

[U we follow the modified loopy junction tree method (see figure 15c). For this example 
lj network, that means that we iteratively solve the belief-propagation equations for the 
super-node lattice, taking care at every iteration to apply the normalization operator 
to messages M?(a be )and M\{a be ) . 

Applying this method to our example network yields results identical to those 
20 obtained from the canonical method, as expected. The results are significantly more 
accurate than those obtained from the "naive" loopy super-node method. 
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We have described the modified loopy super-node method for our example network. 
We now describe how to use this method for an arbitrary network with arbitrary 
5 groupings of nodes into super-nodes. A general version of the method includes the 
following steps as shown in Figure 16. 

In step 1601, group all of the nodes of the network 1600 into a set of intersecting 
clusters 1650. As stated above, the grouping can be heterogeneous and arbitrary. As 
10 an advantage of the arbitrary clustering according to the invention, the user can trade 
*S accuracy for computational complexity. Larger clusters give better results, but take 
^ more time to resolve. Groups of nodes that are less relevant can be grouped into 
ff; smaller clusters. 

15 In step 1602, determine a minimal number of marginalization constraints 1651 that 
fi need to be satisfied between the clusters 1650. 

Fi 5 

Q 

p In step 1603, construct a super-node network 1652. In the network 1652, each cluster 
of nodes is represented by just one super-node 1653. Super-nodes that share one of 
20 the marginalization constraints determined in step 1602 are connected by a super-link 
1654. 



In step 1604, the super-node network 1652 is searched to locate closed loops of 
super-nodes that contain at least one common node. For each such closed loop, 
25 determine a normalization operator 1655. 
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In step 1605, update the messages between super-nodes using standard belief 
propagation. 

In step 1606, replace the messages by their normalized values using the 
5 corresponding normalization operator, and repeat step 1605 until convergence in step 
1607. 



10 



If we have a fixed point of the modified loopy super-node method, then it is 
equivalent to the result obtained from minimizing the Kikuchi free energy. 

Application to Decoding Eiror-Coirectiinig Codes 



ry 



20 



25 



An "error-correcting" code is an encoding of digital, e.g., binary, messages. The code 
adds redmi3aTTcy4haUs designed to protect the transmission of the messages from 
corruption by noise. TypicallyTtrTe~-s«iuj^rof a message transmits "blocks" of bits 
(binary digits), where each block contains the "datebTts^^-ofjhej original message, 
plus some additional "check bits" which help to decode the messagetftris-arriyes at 
a receiver in a corrupted form. For example, a check bit may encode the parity of the 
sum of selected data bits. 

Markov networks have been widely used to represent error-correcting codes, see 
"Good Error-Correcting Codes based on Very Sparse Matrices", by D. MacKay, 
IEEE Transactions on Information Theory, March, 1999. In the Markov network 
formulation of an error-correcting code, some of the nodes of the Markov network 
will correspond to the data bits and the check bits, and the links between the nodes 
will enforce the appropriate relationships between data bits and check bits. 
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The standard decoding algorithm used for some error-correcting codes is the loopy 
belief propagation method, using either marginal or MAP probabilities, run on the 
Markov network that corresponds to the error-correcting code, see R. McEliece, D. 
MacKay, and J. Cheng, IEEE J. Selected Areas in Communication, 1997. When the 
MAP version of belief propagation is used, the decoded states of the bits are just the 
MAP states of the corresponding nodes. When the marginal probability version of 
belief propagation is used, the decoded states of the bits are taken to be those states 
of the corresponding nodes that have the highest marginal probability. 



10 In decoding according to our invention, the loopy belief propagation method is 
l| replaced by the method according to the invention. Our method provides improved 
Co decoding for error-correcting codes. 

to 

m Other Applications 

ill 

« The belief propagation method as described herein can be used in a number of 
jj; applications. It can be used in computer vision problems where scenes need to be 
Q estimated, see U.S. Patent Application Sn. 09/203,108, "Estimating Scenes Using 
Statistical Properties of Images and Scenes," filed by Freeman et al. on November 
20 30, 1998, also see U.S. Patent Application Sn. 09/236,839, "Estimating Targets 

Using Statistical Properties of Observations of Known Targets," filed by Freeman et 
al. on January 25, 1999. It can also be used in object recognition systems such as the 
air-to-ground targeting system described in U.S. Patent No. 5,963,653, "Hierarchical 
information fusion object recognition system and method," issued to McNary et al. 
25 on October 5, 1999. A number of other applications that are well suited for the 
invention are mentioned in U.S. Patent 5,802,256, "Generating improved belief 
networks," issued to Heckerman, et al. on September 1, 1998, for example, crop 
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production estimation, program debugging, coastal ocean environments prediction, 
diagnosing linear lightwave networks. Speech recognition applications that can use 
the present invention are described in U.S. Patent 5,623,609, "Computer system and 
computer-implemented process for phonology-based automatic speech recognition," 
issued to Kaye et al. on April 22, 1997. 

In this description of the invention, we used specific terms and examples. It is to be 
understood that various other adaptations and modifications may be made within the 
spirit and scope of the invention. Therefore, it is the object of the appended claims to 
cover all such variations and modifications as come within the true spirit and scope 
of the invention. 
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